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Abstract 

In this paper we present two algorithms for the 
multiplication of sparse Laurent polynomials and 
Poisson series (the latter being algebraic struc- 
tures commonly arising in Celestial Mechanics from 
the application of perturbation theories). Both al- 
gorithms first employ the Kronecker substitution 
technique to reduce multivariate multiplication to 
univariate multiplication, and then use the school- 
book method to perform the univariate multiplica- 
tion. The first algorithm, suitable for moderately- 
sparse multiplication, uses the exponents of the 
monomials resulting from the univariate multi- 
plication as trivial hash values in a one dimen- 
sional lookup array of coefficients. The second al- 
gorithm, suitable for highly-sparse multiplication, 
uses a cache-optimised hash table which stores the 
coefficient-exponent pairs resulting from the mul- 
tiplication using the exponents as keys. 

Both algorithms have been implemented with 
attention to modern computer hardware architec- 
tures. Particular care has been devoted to the effi- 
cient exploitation of contemporary memory hierar- 
chies through cache-blocking techniques and cache- 
friendly term ordering. The first algorithm has 
been parallelised for shared-memory multicore ar- 
chitectures, whereas the second algorithm is in the 
process of being parallelised. 

We present benchmarks comparing our algo- 
rithms to the routines of other computer algebra 
systems, both in sequential and parallel mode. 



1 Introduction 

The application of perturbation theories for non-linear 
differential equations in Celestial Mechanics is a task 
traditionally associated with cumbersome yet simple op- 
erations on basic algebraic structures. Modern appli- 
cations of perturbative methods to the problem of the 
long-term stability of the Solar System, for instance, 
lead to series expansions whose number of terms is in 
the order of 10 5 - 10 6 [23, 14]. It is then not surpris- 
ing that astronomers and celestial mechanicians have 
sought - since the dawn of the digital age - to fully 
exploit the power and reliability of computers in this 
domain, producing a vast amount of literature on spe- 
cialised (as opposed to general-purpose) algebraic ma- 
nipulators [16, 7, 19, 20, 29, 6, 3, 9, 28, 2, 18, 13]. 

Typically, perturbative methods in Celestial Mechan- 
ics require the ability to manipulate symbolically alge- 
braic structures known as Poisson series [30], consisting 
of Fourier series having Laurent polynomials as coeffi- 
cients: 
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Here the symbolic variables are denoted by x and y, 
while i and j are integer indices and C is a numerical co- 
efficient. The mathematical operations to be performed 
on such objects are usually elementary. E.g., the com- 
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Table 1. Kronecker substitution for a 3-variate polyno- 
mial up to the third power in each variable. 

putation of normal forms using the Lie series technique, 
a popular methodology in Celestial Mechanics, requires 
the operations of addition, subtraction, multiplication, 
differentiation and integration [10]. 

The most taxing operation to be performed on Pois- 
son series, which also forms the basis for more com- 
plicated operations such as Taylor expansions, is series 
multiplication. In this paper we present two algorithms 
for the multiplication of sparse Laurent polynomials and 
Poisson series which employ the technique of Kronecker 
substitution and whose implementation seeks to max- 
imise performance on modern computer hardware archi- 
tectures. 

2 Kronecker substitution 

Kronecker substitution, first described in [22], is a 
methodology that allows to reduce multivariate polyno- 
mial multiplication to univariate multiplication, and it 
can be intuitively understood with the aid of a simple 
table laying out the monomials in reverse lexicographic 
order. The representation for three variables x, y and 
z, and up to the third power in each variable is dis- 
played in Table 1. The column of codes is obtained by 
a simple enumeration of the exponents' multiindices. It 
can be noted how in certain cases the addition of multi- 



indices (and hence the multiplication of the correspond- 
ing monomials) maps to the addition of their codified 
representation. For instance: 

c ([0,0, 3]) +c ([0,1,0]) = 3 + 4 = 7 = 

= c([0,l,3])=c([0,0,3] + [0,l,0]), (2) 

where we have noted with c = c (e) the function that 
codifies a multiindex vector of exponents e. An inspec- 
tion of Table 1 promptly suggests that for an m-variate 
polynomial up to exponent n in each variable, c's effect is 
equivalent to a scalar product between the multiindices 
vectors and a constant coding vector c defined as 



c = (n + 1) , (n + 1) 



so that 



c (e) = c • e, 
and eq. (2) can be generalised as 

c ■ (ei + e 2 ) = c ■ ei + c • e 2 . 



(3) 
(4) 

(5) 



While this equation is valid in general due to the dis- 
tributivity of scalar multiplication, the codification of a 
multiindex will produce a unique code only if the multi- 
index is representable within the representation defined 
by c. 

This simple example shows how Kronecker substitu- 
tion constitutes an addition-preserving homomorphism 
between the space of integer vectors whose elements are 
bound in a finite range and a finite subset of integers. 
Since the addition of codes maps to the addition of mul- 
tiindices, the codes can be seen as exponents of a uni- 
variate polynomial. 

For use with Laurent polynomials and Poisson series, 
Kronecker substitution can be conveniently generalised 
in the following way: 

• we can consider variable codification, i.e., each el- 
ement of the multiindex has its own range of vari- 
ability; 

• we can extend the validity of the codification to neg- 
ative integers. 

The first generalisation allows to compact the range 
of the codes. If, for instance, the exponent of variable x 
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in the example above varies only from to 1 (instead of 
varying from to 3 like for y and z) , we can avoid codes 
that we know in advance will be associated to nonexist- 
ing monomials (i.e., all those in which x's exponent is 
either 2 or 3). The second generalisation derives from 
the fact that under an appropriate extension of the cod- 
ing vector it is possible to change e2's sign in eq. (5) 
retaining the validity of the homomorphism, allowing 
thus to apply Kronecker substitution to the multiplica- 
tion of Poisson series. Poisson series multiplication, in- 
deed, requires the ability to deal also with negative expo- 
nents (since by definition Laurent polynomials may have 
terms of negative degree) ; additionally, the trigonomet- 
ric multipliers (i.e., the j indices in eq. (1)) transform 
under multiplication according to the following elemen- 
tary trigonometric formulas, which imply the need to be 
able to subtract vectors of trigonometric multipliers: 



cos a ■ cos = cos (a — f3) + cos (a + f3) , 
cos a ■ sin (3 = sin (a + j3) — sin (a — j3) , 
sin a ■ cos j3 = sin (a — /3) + sin (a + ft) , 
sin a • sin f3 ~ cos (a — /3) — cos (a + f3) . 



(0) 



Table 2 shows the generalised Kronecker substitution for 
a multivariate Laurent polynomial (and Poisson series) 
in which the exponents (and trigonometric multipliers) 
vary on different ranges, possibly assuming negative val- 
ues. If we define: 

e = (e , ei, . . . , e m _i) , (7) 

^min / max (^0,min / maxi ^l,min / max: ' * * ) 

^m— 1, min / max) ) ($) 
I ~t~ ^fc.max ^-A;,min; (9) 
C = (l,Wo,WoWi,W WiW2,..., 

IOO > (io) 

X = c-e min , (11) 

it is easy to show that the code of the generic multiindex 
e is obtained by 



c (e) = c • e — x- 



(12) 



To recap, this generalisation of the Kronecker substi- 
tution technique allows to reduce the multiplication of 
two multivariate Poisson series and Laurent polynomials 
to the multiplication of two univariate polynomials. 



3 The algorithms 

In most applications of practical interest, the univariate 
polynomials resulting from the application of Kronecker 
substitution are sparse. E.g., the Fateman benchmark, 
presented in §5 and described as a dense benchmark in 
[25], after being reduced to a univariate multiplication 
features roughly 1 non-null monomial every 300 in the 
univariate factors. Series arising in the context of Ce- 
lestial Mechanics are usually sparser. Because of this, 
asymptotically fast algorithms for dense multiplication, 
such as FFT and Karatsuba, are not usually employed in 
Celestial Mechanics (see also the discussion in [11]). The 
algorithms described in this paper thus employ school- 
book (aka ordinary) multiplication: each monomial of 
the first univariate polynomial factor is multiplied by all 
the monomials of the second factor. 

Desirable properties of a multiplication algorithm for 
Celestial Mechanics applications include: 

• the ability to operate on multiple types of numeri- 
cal coefficients (e.g., reals, rationals, integers, both 
in machine precision and multiprecision, complex 
numbers, intervals); 

• the ability to efficiently truncate multiplication. 

The second requirement stems from the observation that 
the number of terms of the series involved in many prac- 
tical calculations tends to explode during multiplication, 
if not controlled properly. Typical truncation criterions 
concern quantities such as the absolute value of the nu- 
merical coefficients, the minimum exponents of one or 
more polynomial variables and the order of the Fourier 
harmonics (see the discussion in [26], Chapter 2, Section 
3). 

In any case, in order to truncate efficiently (i.e., with- 
out having to compute a term to discard it afterwards) , 
a truncation criterion may define an ordering over the 
terms of the series being multiplied. This way, while 
multiplying one monomial of the first factor and iterat- 
ing over the monomials of the second factor, it may be 
possible to skip all monomial-by-monomial multiplica- 
tions from a certain point onwards. The multiplication 
algorithm, hence, should ideally be flexible enough to 
allow for this kind of truncation methodology without a 
negative impact on performance. 
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Table 2. Generalised Kronecker substitution for an m-variate Laurent polynomial (or Poisson series) in which each 
exponent ( or trigonometric multiplier) varies on a different range. 



3.1 Moderately-sparse multiplication 

The first algorithm we present is very simple and suit- 
able for moderately-sparse multiplication. The two uni- 
variate polynomial factors are represented as vectors 
of coefficient-exponent pairs (a representation often re- 
ferred to as sparse distributed). The resulting univariate 
polynomial is represented instead as a vector of coeffi- 
cients, with the positional index of each coefficient im- 
plicitly encoding the corresponding exponent (a dense 
distributed representation) . This vector of coefficients is 
initialised with null values. 

Each monomial-by-monomial multiplication generates 
a coefficient, which is added to the coefficient in the out- 
put vector at the positional index equal to the sum of 
the exponents of the monomial factors (see Figure 1). 
In case of Poisson series multiplication, the same coef- 
ficient is also added to the coefficient at the positional 
index equal to the subtraction of the exponents of the 
monomial factors (as per eqs. (6)). The exponents of the 
monomials resulting from the multiplication, in other 
words, are used as trivial (and perfect) hash values for 
accumulation in a lookup array of coefficients. 

The practical performance of this algorithm crucially 
depends on two optimisations related to cache memory: 

• cache blocking: the two polynomial factors are 
subdivided logically into blocks, and, rather than 
iterating over all the monomials of the second fac- 
tor having fixed a monomial in the first one (in a 
doubly- nested for loop fashion), multiplication is 
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Figure 1. Moderately- sparse multiplication: the multi- 
plication of two terms (C n ,n) and (C m ,m) of the uni- 
variate polynomial factors ( where n and m are the ex- 
ponents to which coefficients C' n and C m are associated) 
produces coefficient C n C m , which is stored in the output 
array of coefficients at position n + m. 



performed block-by-block. The effect is the same 
as if the two factors were subdivided into smaller 
polynomials to be multiplied separately; 

• monomial ordering: the two polynomial factors 
are sorted in ascending order according to the de- 
gree of the monomials. 

The combined effect of these optimisations is a cache- 
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friendly memory access pattern which promotes tempo- 
ral and spatial locality of reference through: 

1. sequential writes into the output coefficient vector, 
by virtue of the monomial ordering, 

2. short-term reuse of data already present into the 
cache memory, by virtue of cache blocking. 

A typical memory access pattern for moderately-sparse 
multiplication is displayed in Figure 2. 

These optimisations cope well also when a different 
monomial ordering is required by the truncation crite- 
rion. Indeed, to retain much of the cache-friendly mem- 
ory access pattern, it is enough to first order the mono- 
mials according to the degree and then reorder within 
each block the monomials according to the order re- 
quested by the truncation criterion. 

3.2 Highly-sparse multiplication 

In case of highly-sparse multiplication, the algorithm de- 
scribed above may not be applicable for the following 
reasons: 

• the output coefficient array would occupy much too 
memory storage, 

• the high sparsity would result in a large array 
whose initialisation overhead would outweigh the 
time spent in the actual multiplication. 

Such occurrences are detectable through an analysis of 
the densities of the univariate polynomial factors. 

The algorithm we have adopted to cope with the 
highly-sparse case is based on hashing techniques, and 
it essentially replaces the output coefficient array of the 
first algorithm with a cache-friendly hash table. Our de- 
sign is just one possibility of cache-friendly hash table: 
other designs, including cuckoo hashing [27] and hop- 
scotch hashing [17] (but also linear probing [21]) may be 
effective too. 

The hash table is implemented as a contiguous mem- 
ory area of size n logically subdivided into N adjacent 
buckets of maximum size m = n/N. The buckets store 
coefficient-exponent pairs (i.e., in a sparse distributed 
representation). The exponents of the monomials result- 
ing from the multiplication of the univariate polynomial 
factors are used, after modulo reduction, as hash values 



to accumulate the monomials into the hash table. E.g., 
when a monomial with exponent e is produced during 
multiplication, its hash value h is computed, 

h = e mod N, (13) 

and the monomial is inserted into the h-th bucket (see 
Figure 3). If the bucket already contains a term with the 
same exponent e, then the coefficient of the incoming 
monomial is added to the existing one. Otherwise the 
monomial is appended at the end of the bucket. Since 
the maximum size of the buckets is m, insertion of a 
new monomial can fail when a bucket already contains 
m elements; in this case the size of the table is increased 
and the elements re-hashed. 

In order to reduce the need for resizing, an addi- 
tional "overflow" bucket is allocated. When an insertion 
fails, the monomial is inserted into the overflow bucket, 
and the hash table resize is delayed until the number 
of monomials in the overflow bucket reaches a certain 
threshold s. The overflow bucket must also be checked 
whenever, during a probe of the table, a full bucket is 
encountered. 

The cache memory optimisations introduced for the 
moderately sparse case can be used also for highly sparse 
multiplication. The only modification concerns term or- 
dering: instead of sorting the univariate polynomial fac- 
tors according to the exponents, the terms are sorted ac- 
cording to the exponents modulo N. Elementary proper- 
ties of modular arithmetics ensure then that consecutive 
write operations happen on consecutive buckets (see Fig- 
ure 4). Each time a re- hash operation takes place, the 
polynomial factors must be re-ordered. 

4 Parallelisation 

The first algorithm has been parallelised for shared mem- 
ory multicore architectures using multiple threads of exe- 
cution. The parallelised algorithm assigns to each thread 
a portion of the univariate polynomial factors, and all 
threads concurrently write the results into the same out- 
put coefficient array. The algorithm relies on the cache 
memory optimisations described earlier to avoid con- 
tention. The following example explains how. 

Let us suppose that the univariate polynomial fac- 
tors, Pi and P2 have been divided into blocks. Since the 
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Figure 2. Typical memory access patterns for moderately-sparse multiplication, with (right) and without (left) cache- 
blocking and term ordering optimisation. The x axis displays the time at which the memory write is performed in the 
output coefficient vector, the y axis displays its location within the vector. 



monomials have been sorted according to their exponent, 
the n-th block of a polynomial will contain monomials 
whose exponents are bound in an interval 



(14) 



whereas the block immediately following will feature ex- 
ponents in the interval 



ith 



In+l — [Sn+l,i> e n +l ,/] 



(15) 



(16) 



From elementary interval arithmetics it follows that mul- 
tiplying Pi's block /4 by P^'s block 1^ will produce 
monomials whose exponents will be in the range 



(2) 



+ e n,H e n,f + e n,f 



(17) 



Similarly, multiplying Pi's block 1^2% by Pa's block I^h 
will produce monomials whose exponents will be in the 
range 

F 1 (18) 



(2) 



"n+l,i 



e n+l,i' e n+l,f 



,(2) 



But then, from eq. (16), the following inequality holds: 



-n,f 



,(2) 



(2) 



e n,f ^ e n+l,i "T" e n+l 



(19) 



This implies that the exponents resulting from the mul- 
tiplication of In by In^ and those of I^+i by I^+i wm 
not overlap. Hence, since the exponents encode also the 
memory position in the output coefficient array, this also 
means that performing the multiplications concurrently 
will not cause any contention issue, since the interested 
memory areas are disjoint. 

The parallelised algorithm can now be explained by 
an example. Let us suppose, for the sake of simplicity, 
that there are 4 available threads, that both univariate 
factors have the same length and that the chosen block 
size divides this length exactly, so that each factor is 
consituted of n > 4 blocks. The first 4 blocks of the 
first factor are assigned one per thread (so that thread 
1 is assigned block 1, thread 2 block 2 and so on). Each 
thread starts by multiplying concurrently its block by 
the corresponding block in the second factor (block 1 by 
block 1, 2 by 2, etc.). A condition variable is used as 
a synchronisation barrier which all threads must reach 
after they have completed their block-by-block multipli- 
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Figure 3. Insertion of monomial CgX 9 into a partially-filled hash table with n — 10, N = 2 and m = 5. 




Figure 4. Typical memory access pattern for highly 
sparse multiplication, with cache blocking and term or- 
dering optimisation. The x axis displays the time at 
which the memory write is performed in the hash table, 
the y axis displays the index of the bucket. 



cations. At this point, each thread moves to the next 
block in the second factor, so that at this point thread 
1 is multiplying block 1 of the first factor by block 2 
of the second factor, thread 2 is performing 2 by 3 and 
so on. The same pattern repeats until the last block of 
the second factor is reached, where each thread alter- 
nately sleeps for 3 (= number of threads — 1) iterations. 
These "silent" iterations are needed in order to make 
sure that concurrent operations are performed on con- 
secutive blocks, as explained above. At this point the 
first four blocks of the first factor have been multiplied 



by all the blocks of the second factor; the threads ac- 
quire the next four blocks in the first factor and restart 
the same procedure from the top of the second factor. 
The sequence of block-by-block multiplications is sum- 
marised in Table 3. 

This same algorithm is also applicable, albeit with 
some additional complications, in case of highly-sparse 
multiplication, where the ordering of the term accord- 
ing to the exponent modulo the hash table's size ensures 
concurrent write on disjoint memory areas (apart from 
writes into the overflow bucket, which must be protected 
by locking). We are currently in the process of imple- 
menting this algorithm for the highly-sparse case. 

5 Benchmarks 

The algorithms described in this paper have been imple- 
mented in CH — h within a specialised algebraic manipula- 
tor for Celestial Mechanics called Piranha [4]. For com- 
parison, we quote the results from [25] , where the tested 
programs are SDMP [25], TRIP [13], PARI [1], Maple 
[24] and Magma [5]. Here we limit the comparison to 
SDMP, which provides the best performance according 
to [25] (other timings can be found in [25]). The three 
benchmarks we present are: 

• Fateman's benchmark [11]: calculate / • g, where 
/ = (l + x + y + z + t) 30 and g = f + 1. / and g 
consist of 46376 terms. This benchmark is suitable 
for the first algorithm. 

• ELP Poisson series multiplication: calculate ELP3 3 • 
ELP3 3 , where ELP3 is the Poisson series represent- 
ing the lunar distance in the main problem of the 
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Table 3. Parallel multiplication of the first jour blocks of the first factor Pi by all the blocks of the second factor Pi. 
The "-" symbols represent silent iterations during which no work is performed by the thread, and the vertical lines 
represent thread barriers. Concurrent multiplications follow column-by-column. 



Lunar Theory ELP2000 [8]. ELP3 is a Poisson se- 
ries with numerical coefficients and ELP3 3 consists 
of 60204 terms. This benchmark is suitable for the 
first algorithm. 

• Monagan-Pearce sparse (MP-sparse) benchmark 
[25]: calculate / • g, where 

/= (l + x + y + 2z 2 + M 3 + 5u 5 ) 12 

and 

g = (l + u + t + 2z 2 + 3y 3 + bx b f 2 . 

f and g consist of 6188 terms. This benchmark is 
suitable for the second algorithm. 

The hardware and software configurations on which the 
tests were performed are displayed in Table 4. Regarding 
our Corei7 configuration, we have to remark here that we 
gained access to the machine shortly before submitting 
the paper and thus we were not able to conduct extensive 
testing. We will produce more complete benchmarks on 
this architecture in the future. 

In the results we report both the wall clock timings 
and the clock cycles per monomial-by-monomial multi- 
plication - shortened as "cepm". An "optimal" algo- 
rithm should be able to have a count of cepm close to 
the number of clock cycles needed to multiply and add 
coefficients on a specific architecture, hence minimising 
the bookkeeping overhead and cache misses. 

5.1 Sequential benchmarks 

The results of the benchmarks in single-thread mode are 
summarised in Table 5. Some considerations: 



Test 


Coefficient 


System 


Time 


cepm 


Fateman 


double 


Core2Quad 


4.29s 


4.8 


Fateman 


double 


Core2Duo 


5.62s 


1.6 


Fateman 


double 


PPC64 


4.96s 


4.6 


Fateman 


double 


Xeon 


3.73s 


4.6 


Fateman 


double 


Atom 


20.15s 


15.0 


Fateman 


GMP mpz 


Core2Quad 


67.90s 


75.8 


Fateman 


61-bit integer 


SDMP-Core2 


60.25s 


67.2 


Fateman 


61-bit integer 


SDMP-Corei7 


70.59s 


85.3 


ELP 


double 


Core2Quad 


15.62s 


10.3 


MP-sparse 


double 


Core2Quad 


1.71s 


107.2 


MP-sparse 


double 


Xeon 


1.59s 


110.5 


MP-sparse 


double 


Corei7 


1.15s 


88.0 


MP-sparse 


3 7- bit integer 


SDMP-Core2 


1.86s 


116.6 


MP-sparse 


37-bit integer 


SDMP-Corei7 


1.56s 


108.4 



Table 5. Serial benchmarks. The SDMP results are 
quoted from [25]. 

• we have performed the Fateman benchmark using 
both double precision and multiprecision integer co- 
efficients (the latter implemented using the GMP 
library [15]). For the same test, SDMP is using 61- 
bit integer coefficients as input and 128-bit integer 
coefficients as output. 

• The first algorithm is able to deliver close to optimal 
performance in the Fateman benchmarks. Indeed, 
according to [12], floating-point multiplication on 
most Intel CPUs has a latency of 4 — 5 clock cy- 
cles. The performance degradation on the Atom 
might be explained by the fact the Atom is the only 
CPU, among those tested, that operates in-order 
(thus resulting in less optimisations available to the 
compiler). 



8 



Short name Hardware 



Software 



Core2Quad 

Core2Duo 

PPC64 

Atom 

Xeon 

Corei7 

SDMP-Gore2 
SDMP-Corei7 



Intel Core2 Q6600, 2.4GHz, 

4 cores, 2 x 4MB L2 cache, 4GB DDR2 

Intel Core2 T7100, 1.8GHz, 

2 cores, 2MB L2 cache, 2GB DDR2 

2 x IBM PowerPC 970, 2GHz, 

2 cores, 512KB L2 cache, 8GB DDR2 

Intel Atom N270, 1.6GHz, 

2 cores (Hyper-threading), 512KB L2 Cache, 1GB DDR2 

2 x Intel Xeon X5355, 2.66GHz, 

8 cores, 2 x 4MB L2 cache, 8GB DDR2 

Intel Core i7-940, 2.93GHz, 8 cores (Hyper-threading), 

4 x 256KB L2 cache, 8MB L3 cache, 4GB DDR3 

Intel Core2 Q6600, 2.4GHz, 

4 cores, 2 x 4MB L2 cache, 4GB DDR2 

Intel Core i7-920, 2.66GHz, 4 cores 

4 x 256KB L2 cache, 8MB L3 cache, 6GB DDR3 



Linux 2.6.31.5, 64 bit, 
GCC 4.4.2, CMP 4.3.2 
Linux 2.6.31.2, 64-bit, 
GCC 4.4.2, CMP 4.3.1 
Linux 2.6.32, 64-bit 
GCC 4.4.2, GMP 4.3.1 
Linux 2.6.31.1, 32-bit, 
GCC 4.4.2, GMP 4.3.1 
Linux 2.6.28, 64-bit, 
GCC 4.3.3, GMP 4.2.4 
OSX 10.6, 64-bit, 
Xcode 3.2, GMP 4.3.2 
Linux 2.6.26, 64-bit, 
GCC 4.3.2, GMP 4.2.2 
Linux 2.6.27, 64-bit, 
GCC 4.3.2, GMP 4.2.2 



Table 4. 

[25]. 



Hardware and software configurations used in the benchmarks. The SDMP configurations are quoted from 



• By comparison, Roman Pearce reported to us in 
a personal communication that coefficient multipli- 
cation by SDMP in the Fateman benchmark costs 
around 18 clock cycles. 

• Performance in the ELP benchmark is also close to 
optimal, considering that Poisson series multiplica- 
tion is slightly more complicated than polynomial 
multiplication. 

• In the highly sparse MP-sparse benchmark, there is 
much more algorithmic overhead and effective cache 
memory usage is more difficult to achieve than in 
the Fateman benchmark. Our algorithm seems to 
benefit greatly from the high amount of cache and 
the DDR3 memory available on the Corei7 configu- 
ration. 



5.2 Parallel benchmarks 

The speedups resulting from the parallel Fateman bench- 
mark on various configurations are displayed in Figure 
5. The speedups, calculated with respect to the sequen- 
tial algorithm, are almost linear for Core2Quad, both 
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Figure 5. Speedup for the parallel Fateman benchmark 
on various systems. The data points for PPC64 overlap 
those for Core2Quad, mpz. 
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with double precision and mpz coefficients, and also for 
PPC64. Also on the Atom a ~ 1.5 speedup is measured, 
despite the use of hyper-threading. 

The oddball is clearly the Xeon system, which, de- 
spite exhibiting performance improvements up to all the 
8 cores, features a smaller speedup with respect to the 
other quad-core processors benchmarked. Moreover, the 
results of multiple benchmark runs on this configuration 
resulted in timings proportionally more unstable than 
on the other systems. A possible explanation for this 
behaviour is that the machine is shared among multiple 
users, and, despite we made sure that no other oner- 
ous computations were going on as we performed the 
tests, other unrelated process (daemons, running graph- 
ical sessions, etc.) may have interfered with the bench- 
mark. Another point of interest is that the Xeon sys- 
tem and the PPC64 systems are the only two configura- 
tions featuring two physically separated processors (as 
opposed to the other single-processor multi-core config- 
urations). It may be possible that inter-processor com- 
munication consitutes a bottleneck for the algorithms 
presented here. 



6 Conclusions 

In this paper we have presented two algorithms for 
the multiplication of sparse Laurent polynomials and 
Poisson series on modern hardware architectures. The 
benchmarks performed on various hardware and soft- 
ware configurations suggest that these algorithms are 
competitive, performance-wise, with the fastest algo- 
rithm currently known and implemented in the SDMP 
library. 

Future work will focus on the parallelisation of the 
highly-sparse algorithm. We also need to test the algo- 
rithms on architectures with a higher number of cores, 
in order to verify where the speedup limit lies; further 
tests are also needed to better assess the effectiveness of 
the algorithms on architectures with shared L3 cache, 
such as the Intel Core i7. Finally, we need to investigate 
the suboptimal speedup exhibited by the Xeon system. 
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